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I A stochastic epidemic model is defined in which each individual belongs to a house- 

hold, a secondary grouping (typically school or workplace) and also the community 
as a whole. Moreover, infectious contacts take place in these three settings according 
to potentially different rates. For this model we consider how different kinds of data 
p ^ ■ can be used to estimate the infection rate parameters with a view to understanding 

. what can and cannot be inferred, and with what precision. Among other things we 

find that temporal data can be of considerable inferential benefit compared to final 
' size data, that the degree of heterogeneity in the data can have a considerable effect 

. on inference for non-household transmission, and that inferences can be materially 

different from those obtained from a model with two levels of mixing. 
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Classical early work in mathematical epidemic modelling usually assumed a homoge- 
^ . neously mixing community of individuals, each having the same susceptibility to disease 
and the same ability to transmit disease (see e.g. Kermack and McKendrick, 1927, and 
Bailey, 1975). Such assumptions rarely reflect reality, and during the last thirty years 
or so, considerable effort has been focused towards modelling different heterogeneities in 
the community in question and their effects on disease propagation (e.g. Anderson and 
May, 1991, Diekmann and Heesterbeek, 2000, Keeling and Rohani, 2007, and references 
therein). Heterogeneities can, broadly speaking, be separated in two different types: 
individual heterogeneities (e.g. susceptibility and infectivity) and social heterogeneities 
caused by the contact structures in the community (e.g. children attending schools). For 
stochastic models, which are the focus of the present paper, the first type of heterogeneity 
is usually represented by multitype epidemics (e.g. Anderson and Britton, 2000, Chap- 
ter 6). Heterogeneities caused by social structures have been incorporated in relatively 
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simple models by assuming mixing at different levels, such as community mixing and 
mixing within households (Ball et ai, 1997), but also using random network models (e.g. 
Andersson, 1999, Britton and O'Neill, 2002, and Newman, 2003). 

More recently, there has been considerable interest in understanding disease spread by 
using models that attempt to capture the essential features of real-life human populations 
(e.g. Ferguson et al, 2005, Longini et al, 2005, Halloran et al., 2008). Mathematical mod- 
els used to address these issues are typically quite complex, incorporating information of 
various sorts that may affect disease propagation in a given community. Examples include 
population age structure, school sizes and household sizes, geographic locations of villages 
and hospitals, information about travel (local and national), disease characteristics such 
as latent and infectious periods, possible interventions, and much more. Models of this 
kind are then studied via intensive computer simulation to identify key features of disease 
propagation and mitigation. 

In order to make such complex models as realistic as possible, it is vitally important to 
assign plausible values to the many parameters in the model. Typically, some parameters 
are well-informed by data from existing studies, while others are not. An important ex- 
ample of the latter are parameters that govern mixing at an intermediate level, meaning 
neither within-household or population-at-large mixing. More precisely, not enough is 
known about the relative importance of disease transmission at schools and workplaces 
(and similar) as compared with transmission within households and more random type 
transmission at community level. However, such information is vital to public health pol- 
icy, since control measures such as school closures or restrictions on large public gatherings 
are aimed precisely at reducing such intermediate transmission (see e.g. Cauchemez et 
ai, 2009). A recent study in this area is Cauchemez et al. (2008) in which model-based 
methods are used to estimate the relative importance of transmission of influenza within 
schools from longitudinal endemic data by comparing the number of reported cases dur- 
ing school term with the number of reported cases during school holidays. The authors 
found that approximately 25% of all transmission among children came via schools, which 
underlines the potential importance of intermediate-level mixing in disease transmission. 

This paper has two main aims. The first is to establish statistical estimation procedures for 
models featuring an intermediate level of mixing, given data on a single outbreak (note this 
is distinct from the longitudinal data considered by Cauchemez et al, 2008). The second 
aim is to use these procedures to assess in broad terms what can and cannot be estimated 
from outbreak data, using simulated data. A stochastic epidemic model incorporating 
three levels of mixing (households, intermediate and community level) is defined. Inference 
procedures are then derived for model parameters assuming two possible kinds of data, 
namely final size data, consisting simply of case numbers, or complete observation of the 
epidemic process through time. These two kinds of data represent extremes corresponding 
to minimal or maximal observation, and we consider them in order to ascertain what can 
and cannot be estimated from actual observational data. 

We consider simulated data from two 3-level mixing scenarios: first where a community 

of households is separated into villages, and secondly where households consists of adults 
and children, the former going to workplaces and the latter to schools. In the first ex- 
ample all members of a household belong to the same group structure (village), whereas 
as in the second example members of a household belong to different group structures 
(schools/workplaces). For both scenarios the simulated data is then used to assess what 
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can be inferred in terms of the model parameters. 

The paper is structured as foUows. In Section 2 we define the epidemic model with three 
levels of mixing. In Section 3 the likelihood is derived for the case of complete observation 
of the epidemic, and inference procedures for final size data arc also discussed. Sections 4 
and 5 are devoted to the two specific examples of community structures mentioned above, 
and we finish with conclusions and discussion in Section 6. 

2 The model 

2.1 Definition and notation 

Consider a community consisting of N individuals. Each individual belongs to one house- 
hold and also to exactly one group of a specific type, such as a school or workplace. In 
the sequel we use the terminology household, group and community to refer to the three 
populations to which an individual belongs. Note that households may consist of different 
numbers of individuals, and the same applies to groups. 

Suppose that the households are labelled 1, . . . , n, and the different groups by 0, 1, . . . , J. 
Here group is a dummy group: individuals that do not belong to a group are said to 
belong to group 0. An individual may thus be thought of as being type meaning 
that they are in household i, and group j. As described below, these types create potential 
differences between individuals in terms of their mixing behaviour. However, individuals 
are otherwise assumed to be similar, meaning that all are equally susceptible to the disease 
and equally able to infect others. Thus the population is homogeneous apart from the 
mixing behaviour of individuals. Finally, note that the only potential difference between 
individuals in the same household is their group membership. 

The model is of SEIR (Susceptiblc-Exposed-Infective-Removed) type (Diekmann and 
Heesterbeek, 2000), meaning that at any point in time, each individual in the popu- 
lation is either susceptible, exposed, infective, or removed. Susceptible individuals have 
the potential to contract the disease. Exposed (or latent) individuals have been infected 
but are not yet capable of transmitting the disease to others. Infectives can transmit the 
disease to others, while removed (or recovered) individuals are no longer infectious, and 
moreover immune to further infections. 

An individual who becomes infected first enters the exposed (or latent) period whose du- 
ration is distributed according to some specified non-negative random variable T^^\ Fol- 
lowing this, the individual becomes infective, remaining so for a period of time distributed 
according to some specified random variable T^^^ with mean E(T(^)) = II. The exposed 
and infectious periods of a single individual and of different individuals are all assumed to 
be mutually independent. We denote by /e and //, respectively, the probability density 
(or mass, as appropriate) functions of the exposed and infectious periods, and denote by 
Fe and the corresponding survivor functions (F(t) = 1 - F{t) = J^°° f{s)ds, t > 0). 

During its infectious period an individual may make three types of contact according to 
various mutually independent Poisson processes of different rates as described shortly. All 
such contacts that take place with susceptible individuals result in the immediate infection 
of that individual, so that the contacted individual enters the exposed period. First, an 
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infectious individual has contacts with each household member independently at times 
given by the points of a homogeneous Poisson process of rate Ah- Second, if the infectious 
individual belongs to group j, then the individual has contacts with each individual in the 
same group according to a Poisson process of rate Xq^ /rij, where rij denotes the number 
of individuals in the group. We set Aq^ = because group is a dummy group. Note 
that the contact rate in different groups can vary. Finally, an infectious individual also 
has contacts with each individual in the entire community according to a Poisson process 
of rate Xq/N. 

Initially, the population consists of one or a few exposed or infective individuals, with all 
other individuals being susceptible (initially immune individuals are simply be ignored). 
The epidemic continues until there are no exposed or infective individuals present in the 
community. Each individual is then either still susceptible, or else they have been infected 
and have recovered. 

Note that the parameters Ac and Aq^ represent the overall rates that an individual (in 

group j) has community and group contacts, respectively. Conversely, the overall rate of 
household contacts is {h — 1)Ah in a household of size h, following the convention of Ball 
et al. (1997). 

2.2 Threshold behaviour 

Stochastic epidemic models typically exhibit threshold behaviour, meaning that in a large 
population, epidemics either die out quickly, or else may infect a non-negligible fraction 
of the population with positive probability (e.g. Andcrsson and Britton, 2000). Moreover, 
this dichotomy is characterised via some threshold parameter i?*, itself a function of the 
model parameters, with — 1 the boundary between the two behaviour regimes: when 
i?* < 1 the epidemic will die out quickly with certainty whereas when i?* > 1 it can 
either die out quickly or else a non-negligible fraction of the population is infected. The 
quantity i?^,, which is a so-called reproduction number, is of key practical importance 
because control strategies typically aim to reduce its value to below the critical value of 
1. 

Threshold behaviour for the above model, when the population size N is large, can be 
derived in a manner similar to that for the multitype two-level mixing model defined in 
Ball and Lyne (2001). We now give a brief outhne of the argument; specifics are described 
later for the particular examples considered in this paper. 

Suppose that the population size, N , is assumed to become large in such a way that 
household sizes remain unchanged, but group sizes become large. Note that this can 
happen in various ways; for instance, the number of groups may be fixed, or may also 
increase. However, the number of different types of group, i.e. the number of distinct 
values of Aq\ is assumed to remain fixed. The key requirement is that, s& N ^ oo 
and for any fixed time t, the probability that a given household receives more than one 
infectious contact from outside the household before t tends to zero. This means that the 
process of infections between households can be regarded as a branching process in which 
an individual corresponds to a household and birth corresponds to infection, the point 
here being that each non-household infection that occurs will be with an individual in a 
previously uninfected household. 
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The threshold parameter for this branching process constitutes a threshold parameter 
for the epidemic model. In general, the branching process will be multitype, where the 
different types reflect the group of the first individual in a household to become infected as 
well as the group type of all household members (cf. Ball and Lyne, 2001). By defining rriij 
as the mean number of type j offspring from a type i individual in the branching process 
(where 'type' refers to the household structure just described), the threshold parameter 
equals the maximal eigenvalue of the matrix M = {rriij). 

In Example 1 of Section 4 below, we consider the simple case of equal-sized households and 
all groups having the same transmission rate Xq- In this setting the threshold parameter 
is simply 

R. = {Xc + XG)EiTA), 

where Ta is the final severity of a single isolated household containing one initial infective 
(i.e. the sum of the infectious periods of those ever infected). It can be shown (Ball et 
ai, 1997) that E{Ta) = nE{T), where T denotes the final number of individuals in the 
household ever infected, including the initial infective, and recalling that fi = E{T'^^''). 
Note also that the threshold parameter is independent of the exposed period. In Example 

2 in Section 5 we derive i?^, for a more complicated model. 

3 Statistical inference 

We now turn our attention to statistical inference, specifically considering two kinds of 
dataset. The first consists of complete temporal information, i.e. knowledge of the state 
of every individual in the population throughout the disease outbreak. The second type 
of data treated consists only of observation at the start and end of the epidemic, i.e. 
knowledge of who was initially susceptible and which of these individuals were infected 
during the outbreak. Our motivation for focussing on these two types of data is that they 
represent extreme scenarios. In practice, actual outbreak data is likely to be somewhere in 
between, for instance observing just removals through time, or weekly aggregates of case 
numbers. By considering the two data types described above, we can gain insight into 
what can and cannot be estimated even in extreme cases, and also evaluate the benefit of 
more detailed data collection. 

Finally, it is also assumed that the social structures are known, i.e. that we know which 
household and secondary grouping each individual belongs to. In practice this assumption 
is not unreasonable, but even so our methods could be extended to take account of missing 
data on social structures. 

3.1 Likelihood based inference of complete data 

We now derive both the likelihood for the complete data and maximum-likelihood es- 
timates for the parameters using counting process theory (e.g. Andersen et ai, 1993). 
For t > let I{t), Ii^{t) and Ifit) denote the number of infective individuals in the 
community, in household i and in group j, respectively, at time t. Let S{t), Sf{t) 
and Sf{t) denote the corresponding susceptible numbers at time t, and further define 
S^J {t) as the number who are in both household i and group j, and susceptible at time 
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t. Thus ^j=o '^i.j (^) ~ ^i^i^)- the notation t— to denote the left hmit, e.g. 

I{t—) = hms|f/(s). Let tij^r denote the time of the rth infection among individuals in 
household i belonging to group j, with the convention that tij^r = oo when no such infec- 
tion occurs. The length of the corresponding infected individual's exposed and infectious 
periods are denoted Tj^j^ and T-^r^ respectively. Assuming that the period of observation 
is [0,t], the likelihood is given by 
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From ([T]) it is possible to derive maximum likelihood (ML) estimates for the contact 
rates {Ah, Xq^', j = 1, . . . , J, Ac}, as described below. It is also straightforward to 
obtain estimates of the parameters of the exposed and infectious period distributions, since 
from ([1]) the complete data provide (potentially censored) independent and identically 
distributed observations from the two distributions. The last product (containing the 
factors x(i,j, r, t)) carries all information about the latent and infectious periods. Thus, 
when focus lies in making inference about transmission parameters or when the infectious 
and latent distributions are known, this product can be neglected. 

To make inference of the contact parameters we use the log-likelihood i = ln(L) and 
differentiate it with respect to each parameter separately, yielding 
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Note that the first summation in the expression for di/dX^ does not extend over j. To 
obtain the ML-estimates these equations are set equal to and are then solved in terms 
of the parameters. There are J + 2 equations and equally many unknowns (parameters). 
Quite often in large communities several groups are of the same type, such as schools or 
villages, thus having the same Xg- Then the corresponding partial derivatives should be 
summed up thus reducing the number of equations to two plus the number of different 
types of groupings. If this number is small, maximum likelihood estimates are easy to 
obtain numerically. 

Finally, Bayesian inference is straightforward given the likelihood defined at ([T]), for in- 
stance by using a Metropolis-Hastings algorithm to obtain approximate samples from the 
joint posterior distribution of the unknown infection rate parameters. Details of such 
inference will be given in the specific examples below. 

3.2 Inference for final size data 

In contrast to complete temporal data, we also consider final size data. Such data consist 
simply of case numbers at the end of the epidemic outbreak, so that there is no explicit 
temporal information. For our three-level mixing model, statistical inference based on final 
size data is generally far more challenging than for complete temporal data, the reason 
being that it is often impractical to evaluate (both analytically and numerically) the 
required likelihood function. Although methods exist to overcome this problem (notably 
data augmentation MCMC methods, see e.g. Demiris and O'Neill, 2005 and O'Neill, 2009), 
here we shall adopt a simpler approach by using an approximation in which households 
behave independently of one another. Such approximations are common in the inference 
literature for two-level mixing models (as discussed in Demiris and O'Neill, 2005), and are 
frequently reasonable in practice, especially in large populations. The details are given 
below. 

3.3 Latent and infectious periods 

As mentioned above, with complete data it is a simple matter to perform statistical infer- 
ence for parameters governing the latent and infectious period distributions. Conversely, 
given final size data it is impossible to estimate latent period parameters, since the final 
size distribution is itself invariant to the choice of latent period (see e.g. Ball et al, 1997). 
Moreover, the final size distribution is not greatly affected by the choice of infectious pe- 
riod distribution other than through its mean, and so most realistic choices of infectious 
period distribution result do not materially affect inferences for infection rate parameters 
(see e.g. O'Neill et al, 2000). In view of these facts, in the numerical illustrations in the 
sequel we shall assume that both latent and infectious periods are simply fixed, i.e. non- 
random. In particular, this means that estimation of infection rate parameters cannot be 
confounded by uncertainty in the estimation of latent or infectious period distribution pa- 
rameters. However, some derivations of quantities of interest (pseudolikelihoods etc.) will 
be given in the general case, i.e. arbitrary distributions for latent and infectious periods. 
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4 Example 1: Households of size 2 in villages 



We start with a fairly simple example that still allows us to explore questions of interest, 
such as what can be estimated, and with what precision. 

4.1 Model and threshold parameter 

Consider a population of n households of size 2, all individuals being of the same type. 
There are m villages, all of the same size, and we assume further that the mixing rate 
within villages is the same for different villages, so Xq^ = Xq for all j. It follows that 
N = 2n and that the villages consist of N/m individuals each. 

As mentioned in Section 2.2, the threshold parameter for this model is given by i?* = 
(Ac + AG)Ai-E(T), where T is the total number of infected individuals in a household in 
which one individual becomes infected. It follows that 

R, = (Ac + AG)Af(l+PH) 

where pH = (1 — E(e~^"^'^')) is the probability that a single infected individual infects 
the other individual in its household via household transmission. 

4.2 Approximate statistical inference from final size data 

Assume now that an outbreak in the community has occurred and that the final size has 
been observed. Since all households contain two individuals who have the same group 
membership, the data can be summarized as n = {nj = {nQ\ri(\n2^) : j = 1, . . . ,m}, 
where n^^^ denotes the number of households in village j in which k individuals were 
infected during the epidemic outbreak. 

Performing inference for Ac, Xq and Ah (or equivalently pn) is possible in the Bayesian 
context by adopting the random graph imputation approach described in Demiris and 
O'Neill (2005). However, here we focus on a much quicker and simpler approach that 
nevertheless enables us to address the question of what can be estimated. 

For j = 1, . . . ,m, let Zj denote the number of ultimately infected individuals in village 
j, so that Zj = Xlfc=o^"'fc '*' define Zj = Zj/{N/m) as the corresponding propor- 
tion of infected individuals in the village. Similarly, define the proportion of the entire 
community that is ultimately infected by Z = X^^Li-^i/-^- By neglecting the depen- 
dence between households we can obtain a pseudolikclihood for the data as follows. First 
note that the probability that an individual avoids infection from a single infective via 
community infection is E[e'''^'^'^'''^ ^^]. As described above, we assume for simplicity that 
T^^^ = H is nonrandom. It then follows that the probability that a susceptible individual 
avoids community infection from k infectives is e'~^'^^^^^ . By neglecting the dependencies 
inherent in the model, it follows that the probability that j susceptibles avoid community 
infection from NZ infectives is ~ ttq , say, where ttc e^^'^'^. Similar argu- 

ments hold for group infections, and we define ttg = e~^°^. The pseudolikelihood for the 
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data using the new parameters is therefore 

m 

L({n,}-i; PH,vrc,7rG) = nKT" [2vr,(l - vr,-)(l -PH)r [2vr,(l - vt^h + (1 - vr,m , 

i=i 

(2) 

where for j = 1, . . . ,m, Hj = e~('^cM^+->^GM^j) = ttqI^q is the approximate probabihty that 
an individual in group j avoids infection from both group and community. Finally, note 
that -R=K = — (logvTc + log7rG)(l + Ph) when written in terms of the new parameters. 

4.3 Bayesian inference 

Bayesian analyses of both the complete and final size data were performed using MCMC 
methods as described below. For the final size data, the three model parameters pu, ttq 
and vTc were assigned independent f/(0, 1) prior distributions. For the complete data, it 
is more natural to work with the original model parameters and the likelihood at ([T]). 
The prior distributions on the original parameters were set to be independent exponential 
with mean 1, since these are equivalent to the U{0, 1) prior distributions on the new 
parameters. For both complete and final size data, Bayesian inference is based on the 
joint posterior density of the model parameters given the data. This density is defined, 
up to proportionality, by the product of the prior density and the appropriate likehhood, 
namely ([T]) for the complete data and the pseudolikelihood ([2]) for the final size data. 

For both complete and final size data, analysis of the posterior density of interest was 
performed using a Metropolis-Hastings algorithm in which each of the three model param- 
eters was updated separately. For final size data, the proposal distribution was U{0, 1), 
which gave adequate mixing in practice. For the complete data, a random-walk algo- 
rithm was used in which the proposal distribution was Gaussian, centered on the current 
parameter value. 

4.4 Results 

We present results using two different simulated datasets, described below. In both cases 
we assume that there are m = 4 identical villages, each consisting of 500 households of 
size two, that the infectious period is Tj = fi = 1 and that the epidemic outbreak is 
initiated by one individual (in village 1) being infected from outside. 

Dataset 1.1 The first data set was simulated from the model with Ah = 0.3, Aq = 1.4 and 
Ac = 0.001, which yields true parameter values pn = 0.259, ttq = 0.247 and hq = 0.999. It 
follows that the threshold parameter is = (Ac + AG)/i(l +Pr) = 1.763. The parameter 
values were chosen to reflect a community in which mixing is quite high within villages 
and much less between villages. 

The simulated outbreak (itself a fairly typical outbreak for the chosen parameter values) 
resulted in the following final outcome data set: 

ni = (70, 157,273), na = (65, 178, 257), ng = (500, 0, 0), n4 = (500, 0, 0). 
Thus ng^-* = 70, n^^^ = 157, n''2^ = 273, for example. 
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Table 1: Posterior density summaries and ML estimates, Dataset 1.1 





True 




Complete data 






Final size data 






value 


Mean 


S. Dev. 


Median 


MLE 


Mean 


S. Dev. 


Median 


MLE 


Ph 


0.259 


0.264 


0.017 


0.264 


0.263 


0.277 


0.037 


0.278 


0.279 




0.247 


0.232 


0.010 


0.233 


0.232 


0.238 


0.014 


0.238 


0.238 


TTc 


0.999 


0.998 


0.002 


0.998 


0.999 


0.999 


0.001 


0.999 


1.000 


-R* 


1.763 


1.847 


0.058 


1.846 


1.847 


1.836 


0.062 


1.835 


1.836 



Table 2: Posterior correlations, Dataset 1.1 





Complete data 


Final size data 


P{PH, TTg) 


0.069 


0.57 


p{Ph, T^c) 


0.00037 


0.0014 


p(vrG,7rc) 


-0.0018 


-0.0080 



Dataset 1.2 The second data set was simulated from the model with Ah = 0.3, Ag = 
Ac = 0.6 giving true parameter values pn = 0.259, vtg = vtc = 0.549 and threshold 
parameter = 1.511. The difference as compared to the parameter values of the first 
dataset is that now the mixing within villages has decreased and community mixing has 
increased. 

The simulated dataset, again typical, was 

ni = (137, 180,183), na = (114, 182, 204), ns = (128, 177, 195), n4 = (126, 188, 186). 

In contrast to dataset 1.1, here each village undergoes a similar outbreak. 

General remarks on estimation Tables [1] - H] contain the results of the MCMC analyses 
and maximum likelihood estimates for the two datasets. The true parameter values are 
given for reference, although since inference is based on just one simulation it follows that 
point estimates are not expected to be identical to true values. 

The estimates in Table [1] illustrate that final size data alone can be sufficient to yield rea- 
sonable estimates of the three model parameters. In addition, the approximate approach 
to inference using a pseudolikelihood for the final size data appears to be effective in this 
case. As would be expected, complete data based estimates are usually more precise than 
final size data based estimates in the sense that the former have lower posterior standard 



Table 3: Posterior density summaries and ML estimates, Dataset 1.2 





True 




Complete data 






Final size data 






value 


Mean 


S. Dev. 


Median 


MLE 


Mean 


S. Dev. 


Median 


MLE 


Ph 


0.259 


0.270 


0.0012 


0.270 


0.269 


0.272 


0.021 


0.272 


0.272 




0.549 


0.529 


0.046 


0.529 


0.525 


0.498 


0.173 


0.451 


0.260 


TTc 


0.549 


0.563 


0.049 


0.561 


0.565 


0.658 


0.195 


0.656 


1.000 




1.511 


1.545 


0.038 


1.545 


1.542 


1.550 


0.040 


1.549 


1.713 
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Table 4: Posterior correlations, Dataset 1.2 



Complete data 



Final size data 



p(7rG,7rc) 



0.012 
0.0032 
-0.94 



0.025 
0.012 
-0.95 



deviations than the latter. The posterior standard deviations of the model parameters 
are reduced by approximately 50% for dataset 1.1 and even more for dataset 1.2. Finally, 

both point and uncertainty estimation for the threshold parameter R^, is similar for both 
kinds of data, illustrating that inference about i?* can be effectively performed with final 
size data alone. Similar findings for standard SIR models are described in Clancy and 



Data heterogeneity affects estimation A key finding is that the extent to which the three 
model parameters can be individually estimated is dependent upon the between-village 
diversity in the data. In dataset 1.1, two villages undergo outbreaks whilst the other 
two remain completely unaffected. Conversely, in dataset 1.2 all four villages experience 
similar outbreaks. Intuitively, this similarity makes it harder to differentiate the roles 
of ttg and ttc, since either group or community infection, or both, could be driving the 
epidemic. This is reflected in the fact that the posterior correlation p(7rG,7rc) for dataset 
1.2 is close to -1 for both flnal size and complete data based estimation. Conversely, in 
dataset 1.1 the corresponding posterior correlations are far lower, illustrating that here it 
is feasible to distinguish between group and community infection. 

Maximum likelihood vs. Bayesian estim,ation It is interesting to note that maximum 
likelihood estimation performs poorly for dataset 1.2 when considering final size data, 
where the estimates for ttc and ttq are nothing like either the true values or the Bayesian 
posterior averages. It seems likely that this occurs due to the shape of the hkelihood 
surface, itself a consequence of the correlation between ttq and ttc discussed above. These 
findings also highlight the need for caution in interpreting ML estimates. 

Are three levels of mixing really necessary? It is natural to compare the differences 
in inference based on three-level mixing models with that based on a two-level mixing 
model in which group-level mixing is ignored, i.e. the within-group infection rate is set 
to zero. For dataset 1.1, by setting ttg = the final size analysis yields £'[pij|n] = 0.549, 
£'[7rc|»^] = 0.549 and £'[i?^,|n] = 1.26. Two points should be noted. First, even though 
we still have the same household-level data, the estimate of the within-household trans- 
mission probability pn is markedly different under the assumption of a two-level mixing 
model. Second, one might at least hope that inference for the threshold parameter 
was relatively robust to the choice of model, but again the choice of model has, in this 
considerable impact. 



5 Example 2: Households and school/ workplaces 



In our second example we consider a population divided into households of equal size, 
where each household contains individuals belonging to one of two kinds of groups, which 
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represent schools and workplaces. 



5.1 Model and threshold parameter 

The mixing structure of the model is defined as follows. The population is partitioned into 
n = 500 households, all of size 4. There are two secondary group structures, type 1 being 
schools with contact rate Aq"* and type 2 being workplaces with contact rates Xq\ In 
each household two individuals go to school and two go to a workplace. For convenience 
we refer to the former two individuals as children, and the latter two as adults, one male 
and one female. There are mi = 10 schools, each of size 100 (= 2n/mi), and m2 = 40 
workplaces each of size 25 (= 2n/m2). 

We assumed the following allocation of children to schools and adults to workplaces. 
Suppose that the households, schools and workplaces are numbered (e.g. 1,2,..., 500 
for households). Children are allocated to schools such that school 1 is populated by 
all children from households 1-50, school 2 by all children from households 51-100, etc. 
(so siblings are in the same school). The two adults in each household are allocated to 
different workplaces: in each of households 1-25 the two adults go to workplaces 1 and 
21 respectively; in each of households 26-50 the two adults go to workplaces 2 and 22 
respectively, etc. Obviously, numerous other allocations (both specified and at-random) 
are possible but the exact choice of allocation seems unlikely to have a material impact 
on model parameter estimation. 

We now derive the threshold parameter i?* for this model by considering the approximat- 
ing branching process in which an individual corresponds to a household in the epidemic 
model. Consider first the epidemic within a household, meaning that group and commu- 
nity transmission is ignored. Recall that within-household transmission is defined such 
that there is no difference between adults and children. The final size distribution in 
such a household, given the number of intially infective individuals, can be derived from 
standard recursive formulae (e.g. Andersson & Britton, 2000, p 16). For s > 0, i > 1 
and A; = 0, . . . , s, let ps\k) denote the probability that exactly k oi s initially susceptible 
individuals become infected during the course of the household epidemic initiated by i in- 
fective individuals. Further, let /is — X]fc=o ^P^^^ (^) denote the mean number of initially 
susceptible individuals who ever become infected. 

During the early stages of the epidemic, with high probability each household receives at 
most one external infectious contact. The average total number of individuals infected 
in a household in which one was externally infected is therefore 1 + fi^^K We define a 
household to be type 1 if the externally infected individual is a child and type 2 if it is an 
adult. 

We now derive the mean offspring matrix M = {rriij), where rriij denotes the expected 
number of type j households that one type i household infects, starting with mn. The av- 
erage number of infected individiials in a type 1 household is l + iJ,^\ All these individuals 
have global contacts at rate Ac with randomly chosen individuals in the community, so 
the community contact rate with children is Ac/2 because half of the community are chil- 
dren. The mean length of the infectious period is fi — E{T^^^), so the expected number of 
type 1 households infected through community contacts equals (1 -|- /X3^'*)/xAc/2. It is also 
possible to infect other type 1 households through school contacts. The expected number 
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of children infected in the type 1 household equals 1 + jJ.^^^ /S, i.e. the externally infected 
child, plus one third of the mean number of susceptibles infected^ since they comprise one 
child and two adults. Each infected child infects on average hXq other children in their 
school. Under the assumption that schools are large, two infected children in a household 
will each infect different individuals at school with high probability. The average num- 
ber of new type 1 households the type 1 household infects through school contacts hence 
equals (1 + fi^^^ /3)fiXQ\ The community and school terms together make up mn. The 
remaining rriij can be obtained similarly, yielding 

M=( (1 + -"s'V^ + (1 + (1 + 4^Vf + ^/^Ag) , 

I (1 + + ^/.Ag) (1 + ,^^),^-^ + (1 + f ' ■ 

The reproduction number R^, is the largest eigenvalue of M, so that 



mil +17122 , {rnii~m22Y , 

R* = ^ + Y ^ + ^1277121, (4) 



where m,,- is defined in Equation 



5.2 Approximate statistical inference from final size data 

We now derive a pseudolikelihood for the final size in a similar manner to that described 
for Example 1 above. Label the schools 1 to mi (=10) and the work places 1 to m2 
(=40). Recall that there are n = 500 households and the community size is = 2000. 
All schools have size Ng = 100 and all workplaces have size N^^ = 25. Let ni*'* denote the 
number of children in school i who ever became infected and the number of adults 
in workplace j who ever became infected [i = 1, . . . , mi, j = 1, . . . , ^2). Let denote 
the total number of individuals in the community who ever become infected. All of the 
quantities defined in this paragraph are known from the final size data. 

Label the households 1 to n (=500) and let kc{h), kf{h) and km{h) respectively denote 
the school of the children, the work place of the female, and the workplace of the male 
in household h. The respective probabilities that these individuals avoid both group and 
community infection are, approximately, 

/ {h\ ( f \ . \ \ nc/N^ (1)n„(*-W)/7V, 

iJc{h) = exp I -fi I Ac— + A^^^^ 1 1 = TTc^' (vr^O ' , 
ijf{h) = exp I -n I Ac— + Ag 1 1 = (^G ) ^ % 

(/ (fcm(/l))\ \ 

the first expression being the probability that a given child in the household avoids external 
infection, etc, and where similarly to the previous section, ttc = e~'^'^^ and vrg'* = e~^G ^ 
(j = 1, 2). We also define = 1 — e~^^^ as the between-individual transmission proba- 
bility within a household, which is the same for all households. 
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We now derive an expression for the probability Ph{i,j, k) that exactly i of the children, 
j females and k males in household h become infected {i = 0, 1,2, j = 0,1, k = 0, 1). 
The expression is approximate because it explicitly depends upon the numbers ultimately 
infected in the school and work places of individuals in household h, and in the community. 
Define as the random vector of numbers of children, females and males ultimately 
infected in household h, so that ph{i,j,k) = P{Ah = {i,j,k)). To compute this, define 
tth, the random vector denoting the numbers of children, females and males infected from 
group or community infection in household h, so that ah = (r, s, t) denotes the event that 
these numbers are r, s and t, respectively. Then 

Ph{i,j,k)= P{ah = {r,s,t))P{Ah = {i,j,k)\ah = {r,s,t)). (5) 

0<r <i 
0<s<j 
0<t<k 

The first factor on the right hand side of ([5]) is 

P(a, = (r, s, t)) = (^^ Mhf-' (1 - Mh)Y MhY'' (1 - Mh))' ^m{hf-' (1 - i^m{h))' . 

For the second factor on the right hand side of ([5]), recall that the within-household 
epidemic is homogeneous in the sense that children and adults behave identically. Now if 
i individuals are externally infected, the probability of a final size of k among the s = A — i 
initially susceptible individuals is just p^li^{k), where p^s\k) was defined in Section 5.1. 
Adjusting this probability to incorporate the extra information on which types (children, 
female and male) were externally infected and which were eventually infected amounts to 
multiplying by appropriate combinatorial factors. It follows that 

P{A, = k)\a, = (r, t)) = p^XVl^' -r + j-s + k- t)^f^^^^^^. 

\i+j+k-{r+s+t)) 

Note that p't\k) is a function of the parameter j9h = 1 — e^'^"'^. 

If the final size data are summarized by {{ih,jh, kh)}h=i, then the pseudolikelihood, which 
treats households as independent, is 

n 

L{{{'ih,jh,kh)}l=i;7^c,7^G\'^G\p^) = YlPh{ih,jh,kh). 

h=l 

For both complete and final size data, we proceed as for Example 1, i.e. using similar 
MCMC methods to perform the Bayesian analyses. 



5.3 Results 

As for Example 1, we consider two different datasets which illustrate various key findings. 
In both cases the simulated outbreak is initiated by one index case who is a child. For 
j = 0, . . . , 4, let Tij be the number of households with final size j, so that ^^=o^i = ^ = 
500. 



14 



Table 5: Posterior density summaries and ML estimates, Dataset 2.1 





True 




Complete data 






Final size data 






value 


Mean 


S. Dev. 


Median 


SS/lLCj 


Mean 


S. Dev. 


Median 


Mljjjj 


Ph 


0.259 


0.282 


0.010 


0.282 


0.282 


0.268 


0.019 


0.268 


0.268 




0.301 


0.289 


0.016 


0.289 


0.289 


0.288 


0.031 


0.285 


0.268 


^G 


0.544 


0.527 


0.023 


0.527 


0.528 


0.482 


0.055 


0.476 


0.445 


TTc 


0.951 


0.951 


0.001 


0.952 


0.953 


0.933 


0.062 


0.952 


1.000 


-R* 


2.082 


2.280 


0.092 


2.278 


2.268 


2.339 


0.107 


2.337 


2.340 



Table 6: Posterior correlations, Dataset 2.1 





Complete data 


Final size data 


P{PH, T^G^) 


0.063 


0.30 


P{PH,T^G^) 


0.072 


0.42 


piPH, VTc) 


0.0067 


0.020 


/ (1) (2)n 


0.015 


0.53 


P{^G^,T^C) 


-0.085 


-0.63 


p{^S\t^c) 


-0.071 


-0.68 



Dataset 2.1 The first dataset was simulated using parameter values Ah = 0.3, Aq = 1.2, 
Ag) = 0.6, and Ac = 0.05. It follows that pn = 0.259, vrjj^ = 0.301, Trg^ = 0.549, 
TTc = 0.951, and i?* = 2.08. The parameter choices reflect high mixing in schools and 
moderate mixing at workplaces, households and the community at large. 

In the simulated epidemic, being a typical one for the chosen parameter values, a total of 
1475 individuals were infected, comprising 793 children, 333 females and 349 males, while 
(no, ni, ?T,2, ""-3, = (25,40,79,147,209). Note that the difference in numbers of males 
and females infected is purely random, since both genders experience the same kind of 
mixing structure under the assumptions of the model. 

Dataset 2.2 The second dataset was simulated using identical parameter values to dataset 
2.1 except that now Ac = 0.005, so that hq = 0.995 and R^^ = 1.99. As a consequence, 
the community mixing is now much lower. 

The resulting simulated epidemic (also being typical) was far less severe, with 278 infected 
comprising 154 children, 68 females and 56 males, and (no, ni, n2, ns, n4) = (401, 13, 20, 39, 27). 

General Remarks Parameter estimates and correlations from analyses of the two datasets 
are listed Tables [5] - [H Both final size and complete data analyses yield broadly similar 
estimates, themselves in line with what might be expected given the true values, all of 
which suggests that the methods of inference are themselves effective. Posterior stan- 
dard deviations are lower for complete data, especially for the community and workplace 
parameters, suggesting that temporal data have particular benefit when it comes to es- 
timating parameters associated with less common forms of transmission. The gain in 
precision from having complete data is not as great when it comes to estimation of -R*. 

Attribution of infection source The results from Dataset 2.1 in Table [5] illustrate that tem- 
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Table 7: Posterior density summaries and ML estimates, Dataset 2.2 





True 




Complete data 






Final size data 






value 


Mean 


S. Dev. 


Median 


SS/LLrj 


Mean 


S. Dev. 


Median 


Mijil/ 


Ph 


0.259 


0.247 


0.021 


0.246 


0.246 


0.237 


0.032 


0.238 


0.235 




0.301 


0.292 


0.036 


0.292 


0.293 


0.300 


0.045 


0.299 


0.296 


^G 


0.544 


0.538 


0.049 


0.524 


0.541 


0.526 


0.076 


0.523 


0.519 


TTc 


0.995 


0.987 


0.008 


0.989 


0.991 


0.995 


0.047 


0.997 


1.000 


-R* 


1.992 


2.02 


0.185 


2.016 


1.989 


1.951 


0.192 


1.943 


1.940 



Table 8: Posterior correlations, Dataset 2.2 





Complete data 


Final size data 




0.068 


0.29 


p{Ph,t^''g) 


0.050 


0.53 


piPH, VTc) 


0.003 


0.003 


/ (1) (2)n 


0.005 


0.056 


P{^G^,T^C) 


-0.016 


-0.017 


p{^S\t^c) 


-0.006 


-0.024 



poral data can give a more accurate picture of where transmission is occurring. Specifi- 
cally, here the final size analysis attributes less infection to within- household contact, and 
more to school and community contact, than the complete data analysis. The correlations 
in Table [6] reinforce this finding, with all but one of the final size analysis correlations 
being of magnitude 0.3 or greater. As for Example 1, posterior correlations are usually 
much smaller for complete data compared to final size data, illustrating the utility of 
temporal data collection. 

How does outbreak size affect estimation? Datasets 2.1 and 2.2 describe very different 
epidemics: the former has attack rate (i.e. proportion infected) 74%, the latter 14%. 
Intuitively one would expect that more cases yield more information about the model 
parameters. This is borne out by comparing the posterior standard deviations in Tables [5] 
and [71 the latter generally being clearly larger than the former for both complete and final 
size data. The one exception to this is the final size estimate of the community parameter 
TTc, which is estimated with higher precision given final size data in the smaller outbreak 
rather than the larger. One possible explanation for this arises from a more detailed 
consideration of Dataset 2.2, and in particular the fact that the outbreak was confined 
to particular subsets of the population. Specifically, only three of the ten schools had 
any cases, and all of the adult cases arose in households with children at these schools. 
Due to the structure of the population, individuals in households whose children did not 
attend the infected schools had only community-level contact with infected individuals. 
Consequently, 1400 individuals in the data are known to have avoided infection due to 
community sources, leading to more precise estimation of ttq. Conversely, in Dataset 2.1 
the outbreak is much larger and so it is not possible to be so unambiguous about how 
individuals escaped infection in the final size analysis. Note that although this argument 
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could also be applied to the complete data analysis, in that case there is already far more 
information about potential sources of infection, which would explain why vrc is estimated 
with greater precision in the larger outbreak. 

Data heterogeneity As for Example 1, the posterior correlations in Tables [6] and [8], along 
with the arguments concerning hq estimation above, illustrate the fact that heterogeneity 
in the data can help estimation. The present Example shows that this conclusion can still 
hold even when comparing outbreaks of very different sizes. 

6 Discussion 

Although this paper is purely methodological, our aim has been to address questions 
that are relevant for actual applied studies. In particular we have tried to address broad 
questions of interest that are raised naturally when one considers epidemic models which 
feature intermediate mixing levels. 

We have derived methods for estimation based on final outcome and complete data, and 
attempted to assess how different kinds of data enable estimation of key model parameters. 
We found that temporal data can be of considerable benefit compared to final outcome 
data, both in terms of the precision of estimates and in terms of distinguishing different 
routes of infection. In particular it was shown that having temporal information about the 
epidemic outbreak makes it easier to distinguish community spread from that occurring 
within secondary group structures. However, the extent to which the data are themselves 
homogeneous is also an important factor in estimation. In particular, heterogeneity in 
the data generally appears to improve estimation. 

We have shown that estimation of the threshold parameter can be materially changed 
when the intermediate level of mixing is ignored. This in turn implies that estimation of 
vaccine coverage levels would also be affected by the presence or otherwise of intermediate 
level mixing, since such levels are usually determined by equating the threshold parameter 
to unity. 

As with any model, some of our assumptions could be made more realistic. For example, 
when considering adults and children it would be natural to allow these two types of 
individual to have differing susceptibility or infectivity. In a similar vein we have not 
taken account of any prior complete or partial immunity in the population. Nevertheless 
it seems likely that the broad qualitative findings are unlikely to be affected by such 
generalisations to the basic model considered here. 
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